-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation Improvements #745
base: main
Are you sure you want to change the base?
Conversation
|
||
readme_misc.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my read_me with grammatical mistakes. I will remove it from .gitignore.
```bash | ||
python scripts/download_checkpoints.py checkpoints/official/OLMo-1B.csv --save-dir ./checkpoints/ --step 2000 | ||
``` | ||
**Note**: All checkpoints in `checkpoints/official/` are unsharded files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Note**: All checkpoints in `checkpoints/official/` are unsharded files. | |
**Note**: All checkpoints in `checkpoints/official/` are unsharded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They aren't just files. Even in unsharded format, a checkpoint still consists of multiple files.
|
||
```bash | ||
torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml --load_path=https://olmo-checkpoints.org/ai2-llm/olmo-small/w1r5xfzt/step1000-unsharded | ||
torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml --load_path=checkpoints/step2000 --save_folder=./new_checkpoints --run_name=olmo_test --save_overwrite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml --load_path=checkpoints/step2000 --save_folder=./new_checkpoints --run_name=olmo_test --save_overwrite | |
torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml --load_path=checkpoints/step2000 --save_folder=./new_checkpoints --run_name=olmo_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without --save_overwrite, the program throws error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if the directory already exists
We provide tools to do this, but first you'll need to download the data as above (unless you have an R2 API key) and update the corresponding config accordingly. | ||
|
||
Then take note of the URL of the data order file you want, which can be found in the [Models Overview](#models-overview) table. For example, the data order file for the first epoch of the OLMo-7B model is [https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy](https://olmo-checkpoints.org/ai2-llm/olmo-small/46zc5fly/train_data/global_indices.npy). | ||
To inspect the exact tokens used in training batches for OLMo models, first download the training data. If you don't have an R2 API key, use the public HTTP URLs and update your configuration file with the local data paths. After completing this setup, you can use the inspection tools to examine the training batches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nobody external would ever have an R2 key. I think we can skip that part of the instructions.
except requests.exceptions.RequestException: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you swallow these exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, because you don't expect all files to be there? Then at least catch only 404 errors.
But better yet, list the contents of the directory in one call to check what's there, instead of making six calls every time we have to check a directory.
parser.add_argument('--save-dir', type=str, default='./checkpoints', | ||
help='Base directory to save downloaded checkpoints') | ||
parser.add_argument('--step', type=str, help='Specific step number to download (optional)') | ||
parser.add_argument('--list-steps', action='store_true', help='List available step numbers and exit') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a tool that can perform multiple different actions, use subcommands.
proceed = input("\nDo you want to proceed with the download? (y/n): ") | ||
if proceed.lower() != 'y': | ||
print("Download cancelled.") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't ask for permission. The tools just do the thing. What if we'd want to script it?
However, that means we have to make sure the tools never do anything dangerous by accident.
for step, url in urls: | ||
save_path = os.path.join(args.save_dir, f"step{step}") | ||
try: | ||
download_checkpoint(url, save_path) | ||
except Exception as e: | ||
print(f"Error during download of step {step}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think anyone will ever want to download all steps? That's a lot of data. I think it's better if we give one command to list steps, and another to download one step, and let them deal with the rest.
# checkpoint_type = ( | ||
# CheckpointType.sharded if cfg.save_num_checkpoints_to_keep != 0 else CheckpointType.unsharded | ||
# ) | ||
checkpoint_type = CheckpointType.unsharded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this?
sharded_checkpointer=cfg.load_path_sharded_checkpointer, | ||
# sharded_checkpointer=cfg.load_path_sharded_checkpointer, | ||
sharded_checkpointer= False, | ||
checkpoint_type=CheckpointType.unsharded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here
Documentation Improvements
Changes Made
scripts/download_checkpoints.py
to automate checkpoint downloads.scripts/train.py
.New Features
The new
scripts/download_checkpoints.py
script: